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(57) Abstract: A method for transcoding a CELP based compressed voice bitstream from source codec to destination codec. The 
method includes processing a source codec input CELP bitstream to unpack at least one or more CELP parameters from the input 
CELP bitstream and interpolating one or more of the plurality of unpacked CELP parameters from a source codec format to a desti- 
nation codec format if a difference of one or more of a plurality of destination codec parameters including a frame size, a subframe 
size, and/or sampling rate of the destination codec format and one or more of a plurality of source codec parameters including a 
frame size, a subframe size, or sampling rate of the source codec format exist. The method includes encoding the one or more 
CELP parameters for the destination codec and processing a destination CEI-,P bitstream by at least packing the one or more CELP 
parameters for the destination codec. 
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A TRANSCODING SCHEME BETWEEN CELP-BASED SPEECH CODES 



CROSS-REFERENCES TO RELATED APPLICATIONS 
[00011 This present application claims priority to U.S. Provisional Applications 
5 60/347.270, filed January 8, 2002, 60/364,403, filed March 1 2, 2002, 60/42 1 ,446, filed 
October 25, 2002, 60/421,449, filed October 25, 2002, and 60/421,270, filed October 25, 
2002, commonly owned, and hereby incorporated by reference for all purposes. 

STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER 
FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT 
10 [00021 NOT APPLICABLE 

REFERENCE TO A "SEQUENCE LISTING," A TABLE, OR A COMPUTER 
PROGRAM LISTING APPENDIX SUBMITTED ON A COMPACT DISK. 
[00031 NOT APPLICABLE 

BACKGROUND OF THE INVENTION 
.15 [0004] The present invention generally relates to techniques for processing information. 
More particularly, the invention provides a method and apparatus for converting CELP 
frames from one CELP based standard to another CELP based standard, and/or within a 
single standard but a different mode. Further details of the present invention are provided 
throughout the present specification and more particularly below. 
20 [00051 Coding is the process of converting a raw signal (voice, image, video, etc) into a 
format amenable for transmission or storage. The coding usually results in a large amount of 
compression, but generally involves significant signal processing to achieve. The outcome of 
the coding is a bitstream (sequence of firames) of encoded parameters according to a given 
compression format. The compression is achieved by removing statistically and perceptually 
25 redundant information using various techniques for modeling the signal. Hence the encoded 
format is referred to as a "compression format" or "parameter space". The decoder takes the 
compressed bitstream and regenerates the original signal. In the case of speech coding, 
compression typically leads to information loss. 

[00061 The process of converting between different compression formats and/or reducing 
30 the bit rate of a previously encoded signal is known as transcoding. This may be done to 
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conserve bandwidth, or connect incompatible clients and/or server devices. Transcoding 
differs from the direct compression process in that a transcoder only has access to the 
compressed signal and does not have access to the original signal. 

[0007] Transcoding can be done using brute force techniques such as "tandem" which has a 
5 decompression process followed by a re-compression process. Since large amount of 

processing is often required and delays may be incurred to decompress and then re-compress 
a signal, one can consider transcoding in the compression space or parameter space. Such 
transcoding aims at mapping between compression formats while remaining in the parameter 
space wherever possible. This is where the sophisticated algorithms of "smart" transcoding 
10 come into play. Although there has been advances in transcoding, it is desirable to further 
improve transcoding techniques. Further details of limitations of conventional techniques 
will be described more fully throughout the present specification and more particularly 
below. 

BRffiF SUMMARY OF THE INVENTION 

1 5 [0008] According to a the present invention, techniques for processing information are 
provided. More particularly, the invention provides a method and apparatus for converting 
CELP ftames from one CELP based standard to another CELP based standard, and/or within 
a single standard but a different mode. Further details of the present invention are provided 
throughout the present specification and more particularly below. 

20 [0009] In a specific embodiment, the invention provides an apparatus for converting CELP 
frames from one CELP-based standard to another CELP based standard, and/or within a 
single standard but to a different mode. The apparatus has a bitstream unpacking module for 
extracting one or more CELP parameters from a source codec. The apparatus also has an 
interpolator module coupled to the bitstream ui^acking module. The interpolator module is 

25 adapted to interpolate between different frame sizes, subframe sizes, and/or sampling rates of 
the source codec and a destination codec. A moping module is coupled to the interpolator 
module. The mapping module is adapted to map the one or more CELP parameters from the 
source codec to one or more CELP parameters of the destination codec. The apparatus has a 
destination bitstream packing module coupled to the mapping module. The destination 

30 bitstream packmg module is adapted to construct at least one destination output CELP frame 
based upon at least the one or more CELP parameters from the destination codec. A 
controller is coupled to at least the destination bitstream packing module, the mapping 
module, the interpolator module, and the bitstream unpacking module. Preferably, the 
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controller is adapted to oversee operation of one or more of the modules and being adapted to 
receive instructions from one or more external applications. The controller is adapted to 
provide a status information to one or more of the external applications. 
(0010) In an alternative specific embodiment, the invention provides a method for 

5 transcoding a CELP based compressed voice bitstream from source codec to destination 
codec. The method includes processing a source codec input CELP bitstream to unpack at 
least one or more CELP parameters from the input CELP bitstream and interpolating one or 
more of the plurality of unpacked CELP parameters from a source codec format to a 
destination codec format if a difference of one or more of a plurality of destination codec 

1 0 parameters including a frame size, a subframe size, and/or sampling rate of the destination 
codec format and one or more of a pluraUty of source codec parameters including a frame 
size, a subframe size, or sampling rate of the source codec format exist. The method includes 
encoding the one or more CELP parameters for the destination codec and processing a 
destination CELP bitstream by at least packing the one or more CELP parameters for the 

1 5 destination codec. 

[001 1 1 In an alternative specific embodiment, the invention provides a method for 
processing CELP based compressed voice bitstreams from source codec to destination codec 
formats. The method includes transferring a control signal from a plurality of control signals 
from an application process and selecting one CELP mapping strategy from a plurality of 

20 different CELP mapping strategies based upon at least the control signal from the application. 
The method also includes performing a mapping process using the selected CELP mapping 
sfrategies to map one or more CELP parameters from a source codec format to one or more 
CELP parameters of a destination codec format. 

[0012) Still further, the invention provides a system for processing CELP based 
25 compressed voice bitstreams from source codec to destination codec formats. The system 
includes one or more memories. Such memories may include one or more codes for 
receiving a control signal from a plurality of control signals from an application process. One 
or more codes for selecting one CELP mapping strategy from a plurality of different CELP 
mapping strategies based upon at least the control signal from the application are also 
30 included. The one or more memories also include one or more codes for performing a 
mapping process using the selected CELP mapping strategies to map one or more CELP 
parameters from a source codec format to one or more CELP parameters of a destination 
codec format. Dq)ending upon the embodiment, there may also be other computer codes for 
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cairying out the functionality described herein, as well as outside of this specification, which 
may be combined with the present invention. 

[00131 Numerous benefits are achieved using the present invention. Depending upon the 
embodiment, one or more of these benefits may be achieved. 

To reduce the computational complexity of the transcoding process. 
To reduce the delay through the transcoding process. 
To reduce the amount of memory required by the transcoding. 
To introduce dynamic rate control 

To support silence frames through an embedded voice activity detector. 
10 • To provide a framework where various parameter mapping strategies can 

be used. 

• To provide a generic transcoding architecture to adapt the current and 
future diversity CELP based codecs. 

[0014] The transcoding invention may achieve one or more of these benefits. In a specific 
1 5 embodiment, the transcoding apparatus includes: 

• a source CELP parameter unpacking module that extracts CELP 
parameters from the input encoded CELP bitstream; 

• a CELP parameter interpolator that converts the input source CELP 
parameters into destination CELP parameters corresponding to the subframe 

20 size difference between source and destination codec; Parameter interpolation 

is used if the subframe size of source and destination codecs are different. 

• a destination CELP parameter mapping and tuning engine that converts 
CELP parameters from the said interpolator module into the destination CELP 
codec parameters; 

25 • a destination CELP codes packer that packs the mapped CELP parameters 

into destination CELP code frames; 

• an advanced feature manager that manages optional functions and features 
in CELP-to-CELP transcoding; 

• a controller that oversees the overall transcoding process; 

30 • a status reporting function that provides the status of the transcoding 

process. 

[0015] The source CELP parameter unpacking module is a simplified CELP decoder 
without a formant filter and a post-filter. 
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[00161 The CELP parameter interpolator comprises of a set of interpolators related to one 
or more of the CELP parameters. 

[0017] The destination CELP parameter mapping and tuning module includes a parameter 
mapping strategy switching module, and one or more of the following parameter mapping 

5 strategies: a module of CELP parameter direct space mapping, a module of analysis in 
excitation space mapping, a module of analysis in filtered excitation space mapping. 
[0018] The invention performs transcoding on a subframe by subframe basis. That is, as a 
frame (of source compressed information) is received by the transcoding system, the 
transcoder can begin operating on it and producing output subframes. Once a sufficient 

10 number of subframes have been produced, a frame (of compressed information according to 
destination format) can be generated and can be sent to the communication channel if 

I 

communication is the purpose. If storage is the purpose, the generated frame can be stored as 
desired. If the duration of the frames defined by the source and destination format standards 
are the same, then a single incoming frame will produce a single outgoing frame, otherwise 

1 5 buffering of either input frames, or generation of multiple output frames will be needed. If 
the subfiles are of different durations, then interpolation between the subframe parameters 
will be required. Thus the transcoding operation consists of four operations: (1) bitstream 
unpacking, (2) subfile buffering and interpolation of source CELP parameters, (3) mapping 
and tuning to destination CELP parameters, and (4) code packing to produce output frame(s). 

20 [00191 So on receipt of a frame, the transcoders unpack the bitstream to produce the CELP 
parameters for each of the subframes contained within the frame (Figure 10, block (1)). The 
parameters of interest are the LPC coefficients, the excitation (produced from the adaptive 
and fixed codewords), and the pitch lag. Note that for a low complexity solution that 
produces good quality, only decoding to the excitation is required and not fiiU synthesis of the 

25 speech waveform. If subframe interpolation is needed, it is done at this point by smart 
interpolation engine (Figure 10, block (2)). 

[0020] The subfiles are now in a form amenable for processing by the destination 
parameter mapping and tuning module (Figure 10, block (5)). The short-term LPC filter 
coefficients are mapped independently of the excitation CELP parameters. Simple linear 
30 mapping in the LSP pseudo-frequency space can be used to produce the LSP coefficients for 
the destination codec. The excitation CELP parameters can be mapped in a number of ways 
giving accordingly better quality output at the cost of computational complexity. Three such 
mapping strategies have been described in this document and are part of the Parameter 
Mapping & Tuning Strategies module (Figure 10, block (4)): 
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• CELP parameter Direct Space Mapping (DSM); 

• Analysis in excitation space domain; 

• Analysis in filtered excitation space domain 

The selection of the mapping and tuning strategy is through the Mapping & Tuning Strategy 
5 Switching Module (Figure 1 0, block (3)). 

[0021] Since the three methods trade-off quality for reduced computational load, they can 
be used to provide graceful degradation in quality in the case of the apparatus being 
overloaded by a large number of simultaneous channels. Thus the performance of the 
transcoders can adapt the available resources. Alternatively a transcoding system may be 
10 built using one strategy only yielding a desired quality and performance. In such a case, the 
Mapping and Tuning Strategy Switching module (Figure 10, Block (3)) would not be 
incorporated. 

[0022] A voice activity detector (operating in the parameter space) can also be employed at 
this point, if applicable to the destination standard, to reduce the outbound bandwidth. 
1 5 [0023] The mapped parameters can then be packed into destination bitstream format frames 
(Figure 10, block (7)) and generated for transmission or storage. 

[00241 Th« invention covers the algorithms and methods used to perform smart transcoding 
between CELP-based speech coding standards. The invention also covers transcoding within 
a single standard in order to perform rate control (by transcoding to lower modes or introduce 

20 silence frames through an embedded Voice Activity Detector). 

[0025] The whole procedure of transcoding is overseen by a Control module (Figure 10, 
block (8)) which sends command based on the status of transcoding and external instructions. 
[0026] In order to adapt different transcoding requirements, the apparatus of the present 
invention provides the capabilities of adding optional features and functions (Figure 10, 

25 block (6)). 

[0027] Other features and advantages of the present invention will be apparent from the 
following description taken in conjunction with the accompanying drawing, in which like 
reference characters designate the same or similar parts throughout the figures thereof. 

BRffiF DESCRIPTION OF THE DRAWINGS 
30 (0028] The objects, features, and advantages of the present invention, which are believed to 
be novel, are set forth with particularity in the appended claims. The present invention, both 
as to its organization and manner of operation, together with further objects and advantages, 
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may best be understood by reference to the following description, taken in connection with 
the accompanying drawings. 

[0029] FIG. I is a simplified block diagram of the decoder stage of a generic CELP coder; 
[0030] FIG. 2 is a simplified block diagram of the encoder stage of a generic CELP coder; 
5 (0031) FIG. 3 is a simplified block diagram showing a mathematical model of a codec; 
[00321 FIG. 4 is a simplified block diagram showing a mathematical model of a tandem 
transcodec; 

[0033] FIG. 5 is a simplified block diagram showing a mathematical model of a smart 
transcodec; 

1 0 [0034] FIG. 6 is an illustration of one of the traditional apparatus for CELP based 
transcoding; 

[00351 FIG. 7 is an illustration of one of the traditional apparatus for CELP based 
transcoding; 

[0036] FIG. 8 is a simplified block diagram showing generic transcoding between CELP 
IS codecs; 

[00371 FIG. 9 is a simplified diagram showing subfiame interpolation for GSM-AMR and 
G.723.1; 

[0038] FIG. 10 depicts a simplified block diagram of a system constructed in accordance 
with an embodiment of the present invention to transcode an input CELP bitstream of from 
20 source CELP codec to an output CELP bitstream of destination codec; ^ 

[0039] FIG. 11 is a simplified block diagram of a source codec CELP parameters unpack 
module in greater detail; 

[00401 FIG. 12 is a simplified diagram showing interpolation of subframe and-sample-by- 
sample parameters for 0.723.1 to GSM-AMR; 
25 [0041] FIG. 13 is a simplified block diagram showing the excitation being calibrated by 
source codec LPC coefficients and destination codec encoded LPC coefficients; 
[0042] FIG. 14 is a simplified block diagram showing Parameter Mapping & Tuning 
Module for CELP parameter mapping in greater detail; 

[0043] FIG. 1 5 is a simplified block diagram of a destination CELP parameters tuning 
30 module in greater detail; 

[0044] FIG. 16 is a simplified diagram showing an embodiment of the destination CELP 

code packing in firames for GSM-AMR; 

[0045] FIG. 1 7 depicts an embodiment of a G.723. 1 to GSM-AMR transcoder; and 
[0046] FIG, 18 depicts an embodiment, of a GSM-AMR to G.723,l transcoder. 
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DETAILED DESCRIPTION OF THE INVENTION 
[0047] According to a the present invention, techniques for processing information are 
provided. More particularly, the invention provides a method and apparatus for converting 
CELP frames from one CELP based standard to another CELP based standard, and/or within 

5 a single standard but a different mode. Further details of the present invention are provided 
throughout the present specification and more particularly below. 
[0048] The invention covers algorithms and methods used to perform smart transcoding 
between CELP (code excited linear prediction) based coding methods and standards. Of 
most interest are the CELP coding methods standardized by bodies such as the International 

1 0 Telecommunication Union (ITU) or the European Telecommunications Standards Institute 
(ETSI). The invention also covers transcoding within a single standard in order to perform 
rate control (by transcoding to lower modes or introduce silence frames through an embedded 
Voice Activity Detector). 

[0049] Speech coding techniques in general can be classified as waveform coders (e.g. 

1 5 standards G.711 , G.726, G.722 from the ITU) and analysis-by-synthesis (AbS) type of coders 
(e.g. G.723.1 and G.729 standards from the ITU, GSM-AMR standard from ETSI, and 
Enhanced Variable-Rate Codec (EVRC), Selectable Mode Vocoder (SMV) standards from 
the Telecommunication Industry Association (TIA)). Waveform coders operate in the time 
domain and they are based on sample-by-sample approach that utilizes the correlation 

20 between speech samples. Analysis-by-synthesis coders try to imitate the human speech 
production system by a simplified model of a source (glottis) and a filter (vocal tract) that 
shapes the output speech spectrum on frame basis (typically frame size of 10-30ms is used). 
[0050] The analysis-by-synthesis types of coders were introduced to provide high quality 
speech at low bit rates, at the expense of increased computational requirements, 

25 Compression techniques are a meaningful way to save the resource in the conununication 
interface. 

[0051 ] Mathematically, all speech codecs start with a one-dimensional analog speech 
signal, X, (r), which is uniformly sampled and quantised to get a digital domain 
representation, x(n) = Q{x, (nT)). The sampling rate, / = ^ , for speech signals is normally 
30 either 8kHz or 1 6kHz, and the sampled si^al is quantised to a maximum typically of 16-bits. 
[0052] A CELP-based codec can then be thought of as an algorithm which maps between 
the sampled speech, x{n), and some parameter space, 6, using a model of speech production, 
i.e. it encodes and decodes the digital speech. All CELP-based algoritiuns operate on frames 
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of speech (which may be further divided into several subframes). In some codecs the speech 
frames overlap each other, A frame of speech can be defined as a vector of speech samples 
beginning at some time n, that is, 

3c, =[jc(n) xin-^l) x(w + L-l)r 

where L is the length (number of samples) of the speech frame. Note that the frame index, /, 
is related to the first frame sample w by a linear relationship, 
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iL for non - overlapping frames 
i{L - K) for overlapping frames. 



where K is the number of samples overlapped between frames. 

[0053] Now the compression (lossy encoding) process is a fiinction which maps the speech 
frames, jc, , to parameters, ^, , and the decoding process maps back from the parameters, 0^ , 

15 to an approximation of the original speech frames, Jc, . The speech frames that are produced 
by the decoder are not identical to the speech frames that were originally encoded. The codec 
is designed to produce output speech which is as perceptually similar as possible as the input 
speech, that is, the encoder must produce parameters which maximize some perceptual 
criterion measure between input speech frames and the frames produced by the decoder when 

20 processing the parameters. 

[0054] In general the mapping from input to parameters, and from parameters to output, 
requires knowledge of all previous input or parameters. This can be achieved by maintaining 
state within the codec, 5, for example in the construction of the adaptive codebook used by 
CELP based methods. The encoder state and decoder state must remain synchronized. This 

25 is achieved by only updating the state based on data which both sides (encoder and decoder) 
have, i.e. the parameters. Figure 3 shows a generic model of an encoder, channel, and 
decoder. 

[0055] The frame parameters, , used in CELP-based models, consist of the linear- 
predictive coefficients (LPCs) used for short-term prediction of the speech signal (and 
30 physically relating to the vocal tract, mouth and nasal cavity, and lips), as well as excitation 
signal composed from adaptive and fixed codes. The adaptive codes are used to model long- 
term pitch infomiation in the speech. The codes (adaptive and fixed) have associated 
codebooks that are predefined for a specific CELP codec. Figure 1 shows a typical CELP 
decoder where the adaptive and fixed codebook vectors are scaled independently by a gain 
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factor, then combined and filtered to produce synthesized speech. This speech is usually 
passed through a post-filter to remove artifacts introduced by the model. 
[0056] The CELP encoding (analysis) process, shown in Figure 2, involves preprocessing 
of the speech signal to remove unwanted frequency components and application of a 
5 windowing function, followed by extraction of the short-term LPC parameters. This is 

typically done using the Levinson-Durbin algorithm. The LPC parameters are converted into 
Line Spectral Pairs (LSPs) to facilitate quantization and subfi-ame interpolation. The speech 
is then inverse-filtered by the short-term LPC filter to produce a residual excitation signal. 
This residual is perceptually weighted to improve quality and is analysed to find an estimate 
10 of the pitch of the speech. A closed-loop analysis-by-synthesis method is used to determine 
the optimal pitch. Once the pitch is found the adaptive codebook component of the excitation 
is subtracted firom the residual, and the optimal fixed codeword found. The internal memory 
of the encoder is updated to reflect changes to the codec state (such as the adaptive 
codebook). 

1 5 100571 The simplest method of transcoding is a brute-force approach called tandem 

transcoding, see Figure 4. This method performs a fiiU decode of the incoming compressed 
bits to produce synthesized speech. The synthesized speech is then encoded for the target 
standard. This method suffers fi^om the huge amount of computation required in re-encoding 
the signal, as well as fi-om quality degradation issues introduced by pre- and post-filtering of 

20 the speech waveform, and firom potential delays introduced by the look-ahead-requirements 
of the encoder. 

[00581 Methods for "smart" transcoding similar to that illustrated in Figure 5 have 
appeared in the literature. However these methods still essentially reconstruct the speech 
signal and then perform significant work to extract the various CELP parameters such as LPC 

25 and pitch. That is, these methods still operate in the speech signal space. In particular, the 
excitation signal which has already been optimally matched to the original speech by the far- 
end encoder (encoder at the far-end that has produced the compressed speech according to a 
compression format) is only used for the generation of the synthesised speech. The 
synthesised speech is then used to compute a new optimal excitation. Due to the requirement 

30 of incorporating impulse response filtering operations in closed-loop searches, this becomes a 
very computationally intensive operation. Figure 6 illustrates the method used by US- 
6,260,009 Bl. The reconstructed signal which is used as target signal by the Searcher is 
produced firom the input excitation parameters and output quantized formant filter 
coefficients. Due to the differences between quantized formant filter coefficients in the 
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source and destination codecs, this leads to degradation in the target signal for the Searcher 
and finally the output speech quality from the transcoding is significantly degraded. See 
Figure 6. Other limitations may be found throughout the present specification and more 
particularly below. 

5 [00591 Another "smart" transcoding method illustrated by Figure 7. (US2002/0077812 Al) 
has been published. This method performs transcoding through mapping each CELP 
parameter directly ignoring the interaction between the CELP parameters. The method is only 
applicable for a special case that requires very restricted conditions between source and 
destination CELP codecs. For an example, it requires Algebraic CELP (ACELP) and same 
1 0 subframe size in both source and destination codecs. It does not produce good quality speech 
for most CELP based transcoding. This method is only suitable for one of the GSM-AMR 
modes and it doesn't cover all the modes in GSM- AMR. 

[0060] A method and apparatus of the invention are discussed in detail below. In the 
following description, for purposes of explanation, numerous specific details are set forth in 

1 5 order to provide a thorough understanding of the present invention. The case of GSM-AMR 
and G.723.1 are used for illustration purpose and for examples. The methods described here 
are generic and apply to the transcoding between any pair of CELP codecs. A person skilled 
in the relevant art will recognize that other steps, configurations and arrangements can be 
used without departing fi-om the spirit and scope of the present invention. 

20 [00611 The invention covers the algorithms and methods used to perform smart transcoding 
between CELP-based speech coding standards. The invention also covers transcoding within 
a single standard in order to perform rate control (by transcoding to lower modes or introduce 
silence frames through an embedded Voice Activity Detector). The following sections 

discuss the details of the present invention. 
25 [0062] The invention performs transcoding on a subframe by subframe basis. That is, as a 

frame is received by the transcoding system, the transcoder can begin operating on its 
subframes and producing output subframes. Once a sufficient number of subframes have 
been produced, a frame can be generated. If the duration of the frames defined by the source 
and destination standards are the same, then one input frame will produce one output frame, 
30 otherwise buffering of either input frames, or generation of multiple output frames will be 
needed. If the subframes are of different durations, then interpolation between the subframe 
parameters will be required. Thus the transcoding operation consists of four operations: (1) 
bitstream unpacking, (2) subframe buffering and interpolation of source CELP parameters. 
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(3) mapping and tuning to destination CELP parameters, and (4) Code packing to produce 
output fi:ame(s), (see Figure 8). 

(00631 Figure 1 0 is a block diagram illustrating the principles of a CELP based codec 
transcoding apparatus according to the present invention. The block comprises a source 
5 bitstream unpacking module, a smart interpolation engine, parameter mapping and tuning 
module, an optional advanced features module, a control module, and destination bitstream 
packing module. 

[0064] The parameter mapping & tuning module comprises a mapping & tuning strategy 
switching module and parameter mapping & tuning strategies module. 
10 [0065] The transcoding operation is overseen by the control module. 

[00661 So on receipt of a frame, the transcoder unpacks the bitstream to produce the CELP 
parameters for each of the subframes contained within the frame. The parameters of interest 
are the LPC coefficients, the excitation (produced from the adaptive and fixed codewords), 
and the pitch lag. 

1 5 [00671 Note that only decoding to the excitation is required, and not fiiU synthesis of the 
speech waveform. This reduces the complexity of the source codec bitstream unpacking 
significantly. The codebook gains and fixed codewords are also of interest for CELP 
parameter Direct Space Mapping (DSM) transcoding strategy. If subframe interpolation is 

needed, it is done at this point. 

20 [00681 The subframes are now in a form amenable for processing by the destination 
parameter mapping and tuning module shown in Figure 14. The short-term LPC filter 
coefficients are mapped independently of the excitation CELP parameters. Simple linear 
mapping in the LSP pseudo-frequaicy space can be used to produce the LSP coefficients for 
the destination codec. More sophisticated non-linear interpolation can also be used. The 

25 excitation CELP parameters can be mapped in a number of ways giving accordingly better 
quality output at the cost of computational complexity. Three such mapping strategies have 
been described in this document and are part of the Parameter Mapping & Tuning Strategies 
module (Figure 1 0, block (4)): 

• CELP parameter Direct Space Mapping (DSM); 
30 • Analysis in excitation space domain; 

• Analysis in filtered excitation space domain 

The selection of the mapping and tuning strategy is through the Mapping & Tuning Strategy 
Switching Module (Figure 10, block (3)). 
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[0069) These three methods are discussed in detail in the following sections. Since the 
three methods trade-off quality for reduced computational load, they can be used to provide 
graceful degradation in quality in the case of the apparatus being overloaded by a large 
number of simultaneous channels. Thus the performance of the transcoders can adapt the 

5 available resources. Alternatively a transcoding system may be bui It using one strategy only 
yielding a desired quality and performance. In such a case, the Mapping and Tuning Strategy 
Switching module (Figure 10, Block (3)) would not be incorporated. 
[00701 A voice activity detector (operating in the parameter space) can also be employed at 
this point, if applicable to the destination standard, to reduce the outbound bandwidth. 

10 [0071] The outputs of parameter mapping and tuning module are destination CELP codec 
codes. They are packed into destination bitstream frames according to the codec CELP frame 
format. The packing process is needed to put the output bits into fonnat that can be 
understood by destination CELP decoders. If the application is for storage, the destination 
CELP parameters could be packed or could be stored in an application specific format. The 

1 5 packing process could also be varied if the frames are to be transported according to a 
multimedia protocol, as for example bit scrambling is to be implemented in the packing 
process. 

[0072] Furthermore, the apparatus of the present invention provides the capability of 
adding future optional signal processing functions or modules. 

20 Subframe Interpolation 

[0073] Subframe interpolation may be needed when subfiles for different standards 
r^resent different time durations in the signal domain, or when a different sampling rate is 
used. For example G.723.1 uses frames of 30ms duration (7.5ms per subframe), and GSM- 
AMR uses frames of 20ms duration (5ms per subframe). This is shown pictorially in Figure 

25 9. Subframe interpolation is performed on two different types of parameters: (1) sample-by- 
sample parameters (such as excitation and codeword vectors), and (2) subframe parameters 
(such as LSP coefficients, and pitch lag estimates). The sample-by-sample parameters are 
mapped by considering their discrete time index and copying to the appropriate location in 
the target subframe. Up- or down-sampling may be required if different sample rates are 

30 used by the different CELP standards. The subframe parameters are interpolated by some 
interpolation function to produce a smoothed estimate of the parameters in the target 
subframe. A smart interpolation algorithm can improve the voice transcoding, not only in 
terms of computational performance, but more importantly in terms of voice quality. A 
simple interpolation function is the linear interpolator. 
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[0074] As an example. Figure 9 shows that three GSM- AMR frames are needed to describe 
the same duration of speech signal as two G.723.1 frames. Likewise three GSM-AMR 
subfiles are needed for every two G.723.1 subframes. As described above, there are two 
types of parameters: subframe-wide parameters (for example, the LSP coefficients) and 
5 sample-by-sample parameters (for example, the adaptive and fixed codewords). Subframe 
parameters, denoted ft are converted linearly, by calculating the weighted sum of 
overlapping subframes, and sample-by-sample parameters, denoted v[ ], are foniied by 
copying the appropriate samples. For interpolation to GSM-AMR subframes from G.723.1 
subfi:ames, the analytical formula is shown as following: 

10 

ef" = e(^ll i mod 3 = 0,2 

^r=ite'+^(i/"f) imod3 = l 

vr[«] = vj$i^^l),«,j[(40/ + /.)mod60] V/.n 

where / = 0 is the first subframe of the first GSM-AMR frame, / = 4 is the first subfirame of 

the second GSM-AMR frame, etc. Figure 12 depicts this process. 
1 5 (00751 The LSP parameters, which are subfiame-wide parameters should be interpolated in 

the pseudo-frequency domain, i.e. / = cos'' {q) . This results better quality output. The other 

subframe parameters do not need to be transformed before interpolating. 

[00761 Note that the above analytical formula is derived from a simple linear interpolator. 

The foraiula can be replaced by any appropriate interpolation scheme, such as spline, 
20 sinusoidal, etc. Furthennore, each CELP parameter (LSP coefficients, lag, pitch gain, 

codeword gain and etc) can use different interpolation scheme to achieve best perceptual 

quality. 

LSP Parameter Mapping and Excitation Vector Calibra tion hv LSP Coefficients 

(00771 Although ahnost all CELP based audio codecs make use of the same approaches to 

25 obtain UC coefficients, there are still some minor differences. Theses differences are due to 
different window size and shape, differait LPC interpolation for each subframes, different 
subframe sizes, different LPC quantisation schemes, and different look-up tables. 
(00781 In order to further improve audio transcoding quality pr6dtzced through the 
subframe interpolation method described above, the excitation vectors used as target signals 

30 in transcoding are calibrated by applying LPC data from the source and destination codecs. 
(00791 The following two methods can be employed to improve perceptual quality. 
Method 1 : Linear transform of the LSP Coefficients 
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[0080] A generic method for converting between LSP coefficients is via a linear transform, 

g'-Aq + b 

where ^ ' is the destination LSP vector (in the pseudo-frequency domain), q is the source 
(original) LSP vector, yl is a linear transform matrix and b is the bias term. In the simplest 
5 case, A reduces to the identity matrix and b reduces to zero. For the embodiment of the 
GSM-AMR to G.723.1 transcoder, the DC bias term used in the GSM-AMR codec is 
different from the one used by the G.723.1 codec, the b term in the equation above is used to 
compensate for difference. 

Method 2: Excitation Vector Calibration bv LSP Coefficients 
10 [00811 The decoded source excitation vector is synthesized by source LPC coefficients in 
each subframes to convert to the speech domain and then filtered using quantized LP 
parameters of the destination codec to form the target signal in transcoding. This calibration 
is optional and it can significantly improve the perceptual speech quality where there is a 
marked difference in the LPC parameters. Figure 13 depicts the excitation calibration 

1 5 approach. 

Parameter Mapping & Tuning Module 

[0082] This section discusses three strategies for mapping the CELP excitation parameters. 
They are presented in order of successive computational complexity and output quality. The 
core of the invention is the fact that the excitation can be mapped directly without the need to 

20 reconstruct the speech signal This means that significant computation is saved during 

closed-loop codebook searches since the signals do not need to be filtered by the short-term 
impulse response, as required by conventional techniques. This mapping works because the 
incoming bitstream contains already optimal excitation according to the source CELP codec 
for generating the speech. The invention uses this fact to perform rapid searching in the 

25 excitation domain instead of the speech domain. 

[0083] As mentioned previously, having three methods for excitation mapping, each with 
successively better performance, allows the transcoders to adapt to the available computation 
resources. 

CELP Parameters Direct Space Mapp ing 
30 [0084] This strategy is the simplest transcoding scheme. The mapping is based on 
similarities of physical meaning between source and destination parameters and the 
transcoding is performed directly using analytical formula without any iterating or searching. 
The advantage of this scheme is that it does not require a large amount of memory and 
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consumes almost zero MIPS but it can still generate intelligible, albeit degraded quality, 
sound. Note that the CELP parameters direct space mapping method of the present invention 
is different to the apparatus of prior art showing in Figure 7. This method is generic and it 
applies to all kind of CELP based transcoding in term of different frame or subfiame size, 
5 different CELP codes in source and destination. 

Analysis in Excitation Space Domain 

[0085] This strategy is more advanced than the previous one in that both the adaptive and 
fixed codebooks are searched, and the gains estimated in the usual way defined by the 
destination CELP standard, except that they are done in the excitation domain, not the speech 

1 0 domain. The pitch contribution is determined first by local search using the pitch torn the 
input CELP subframe as the initial estimate. Once found, the pitch contribution is subtracted 
from the excitation and the fixed codebook determined by optimally matching the residual. 
The advantage over the tandem approach is that the open-loop pitch estimate does not need to 
be calculated from the autocorrelation method used by the CELP standards, but can instead 

15 be determined from the pitch lag of the decoded CELP subfile. Also the search is 

performed in the excitation domain, not the speech domain, so that impulse response filtering 
during pitch and codebook searches is not required. This saves a significant amount of 
computation without compromising output quality. 

Analysis in Filtered Excitation Space Domain 

20 100861 In this case, the LP parameters are still mapped directly from the source codec to the 
destination codec and the decoded pitch lag is used as the open-loop pitch estimation for the 
destination codec. The closed-loop pitch search is still performed in the excitation domain. 
However, the fixed-codebook search is performed in a filtered excitation space domain. The 
choice of the type of filter, and whether the target vector is converted to this domain for one 

25 or both searches, will depend on the desired quality and complexity requirements. 

[00871 Various filters are applicable, including a lowpass fiher to smooth irregularities, a 
filter that compensates for differences between characteristic of the excitation in the source 
and destination codecs, and a filter which enhances perceptually important signal features. 
An advantage is that unlike the computation of the target signal in standard encoding, which 

30 uses the weighted LP synthesis filter, the parameters of this filter (order, frequency emphasis 
/de-emphasis, phase) are completely tunable. Hence, this strategy allows for tuning to 
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improve the quality for transcoding between a particular pair of codecs, as well as the 
provision to trade off quality for reduced complexity. 



Silence Frame Transcoding and Generation 

[0088] Some CELP-based standards implement Voice Activity Detectors (VAD) which 
5 allow discontinuous transmission (DTX) and comfort noise generation (CNG) during periods 
of no speech. There is a significant bit rate advantage in employing VAD. Transcoding 
between these frames is required, as well as generation of silence frames for destination 
codecs in the event of silence frames not being generated by the source codec. Usually the 
frames consist of parameters for generating the suitable comfort noise at the decoder. These 
1 0 parameters can be transcoded using simple algebraic methods. 

Example Embodiments of the Invention 

[0089] The following sections demonstrate embodiments of the invention for the G.723 . 1 
and GSM-AMR speech coding standards. The invention is not limited to these standards. It 
covers all CELP-based audio coding standards. Anyone skilled in the art will recognize how 
1 5 to apply these methods to transcode between other CELP-based coding standards. Before 
describing preferred embodiments, a brief description of the GSM-AMR and G.723.1 codecs 
is first provided. 

GSM-AMR Codec 

I 

[00901 The GSM-AMR codec uses eight source codecs with bit-rates of 12.2, 10.2, 7.95, 

20 7.40, 6.70, 5.90, 5.15 and 4.75 kbit/s. 

(00911 The codec is based on the code-excited linear predictive (CELP) coding model. A 
10th order linear prediction (LP), or short-term, synthesis filter is used. The long-term, or 
pitch, synthesis filter is implemented using the so-called adaptive codebook approach. 
[00921 In the CELP speech synthesis model, the excitation signal at the input of the short- 

25 term LP synthesis filter is constructed by adding two excitation vectors firom adaptive and 
fixed (innovative) codebooks. The speech is synthesized by feeding the two properly chosen 
vectors firom these codebooks through the short-term synthesis filter. The optimum 
excitation sequence in a codebook is chosen using an analysis-by-synthesis search procedure 
in which the error between the original and synthesized speech is minimized according to a 

30 perceptually weighted distortion measure. The perceptual weighting filter used in the 
analysis-by-synthesis search technique uses the unquantized LP parameters. 

17 
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(00931 The coder operates on speech frames of 20 ms corresponding to 1 60 samples at the 
sampling frequency of 8 000 sample/s. At each 160 speech samples, the speech signal is 
analysed to extract the parameters of the CELP model (LP filter coefficients, adaptive and 
fixed codebooks' indices and gains). These parameters are encoded and transmitted. At the 
5 decoder, these parameters are decoded and speech is synthesized by filtering the 
reconstructed excitation signal through the LP synthesis filter. 

[0094] LP analysis is performed twice per frame for the 1 2.2kbit/s mode and once for the 
other modes. For the 12.2 kbit/s mode, the two sets of LP parameters are converted to line 
spectrum pairs (LSP) and jointly quantized using split matrix quantization (SMQ) with 38 

1 0 bits. For the other modes, the single set of LP parameters is converted to line spectrum pairs 
(LSP) and vector quantized using split vector quantization (SVCJ).. 
[0095] The speech frame is divided into four subframes of 5 ms each (40 samples). The 
adaptive and fixed codebook parameters are transmitted every subframe. The quantized and 
unquantized LP parameters or their interpolated versions are used depending on the 

1 5 subframe. An open-loop pitch lag is estimated in every other subfile (except for the 5 . 1 5 
and 4.75kbit/s modes for which it is done once per fi^e) based on the perceptually weighted 
speech signal. 

[0096] Then the following operations are repeated for each subframe: 

• The target signal is computed by filtering the LP residual through the 
20 weighted synthesis filter with the initial states of the filters having been 

updated by filtering the error between LP residual and excitation (this is 
equivalent to the conmion approach of subtracting the zero input response of 
the weighted synthesis filter from the weighted speech signal). 

• The impulse response of the weighted synthesis filter is computed. 
25 • Closed-loop pitch analysis is then performed (to find the pitch lag and 

gain), using the target and impulse response, by searching around the open- 
loop pitch lag. Fractional pitch with l/6th or l/3rd of a sample resolution 
(depending on the mode) is used. 

• The target signal is updated by removing the adaptive codebook 

30 contribution (filtered adaptive codevector), and this new target is used in the 

fixed algebraic codebook search (to find the optimum innovation codeword). 

• The gains of the adaptive and fixed codebook are scalar quantified with 4 
and 5 bits respectively or vector quantified with 6-7 bits (with moving average 
(MA) prediction applied to the fixed codebook gain). 
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• Finally, the filter memories are updated (using the determined excitation 
signal) for finding the target signal in the next subfi-ame. 
[0097] In each 20 ms speech fi-ame, the bit allocation of 95, 103, 1 18. 134, 148, 159. 204 or 
244 bits are produced, corresponding to a bit-rate of 4.75, 5.15, 5.90, 6.70, 7.40, 7.95, 10.2 or 
5 12.2 kbps. 

TheG.723.1 Codec 

[00981 The G.723. 1 coder has two bit rates associated with it, 5.3 and 6.3 kbps. Both rates 
are a mandatory part of the encoder and decoder. It is possible to switch between the two 
rates on any 30 ms firame boundary. 

1 0 (00991 The coder is based on the principles of linear prediction analysis-by-synthesis 

coding and attempts to minimize a perceptually weighted error signal. The encoder operates 
on blocks (frames) of 240 samples each. That is equal to 30 msec at an 8 kHz sampling rate. 
Each block is firet high pass filtered to remove the DC component and then divided into four 
sub frames of 60 samples each. For every sub-frame, a 10th order linear prediction coder 

1 5 (LPC) filter is computed using the unprocessed input signal. The LPC filter for the last sub- 
frame is quantized using a Predictive Split Vector Quantizer (PSVQ). The unquantized LPC 
coefficients are used to construct the short term perceptual weighting filter, which is used to 
filter the entire frame and to obtain the perceptually weighted speech signal. 
[01001 For every two sub-frames (1 20 samples), the open loop pitch period, Iqi » 

20 computed using the weighted speech signal. This pitch estimation is performed on blocks of 
120 samples. The pitch period is searched in the range from 18 to 142 samples. 
[01 01 1 From this point the speech is processed on a 60 samples per sub-frame basis. 
[0102] Using the estimated pitch period computed previously, a harmonic noise shaping 
filter is constructed. The combination of the LPC synthesis filter, the formant perceptual 

25 weighting filter, and the harmonic noise shaping filter is used to create an impulse response. 
The impulse response is then used for further computations. 

[01 031 Using the pitch period estimation, Lql , and the impulse response, a closed loop 
pitch predictor is computed. A fifth order pitch predictor is used. The pitch period is 
computed as a small differential value around the open loop pitch estimate. The contribution 
30 of the pitch predictor is then subtracted from the initial target vector. Both the pitch period 
and the differential value are transmitted to the decoder. 
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[01 04] Finally the non periodic component of the excitation is approximated. For the high 
bit rate, multi-pulse maximum likelihood quantization (MP-MLQ) excitation is used, and for 
the low bit rate, an algebraic codebook excitation (ACELP) is used. 

First FmhnHiment - GSM-AMR To 6.723.1 

5 [OlOSJ Figure 17 is a block diagram illustrating a transcoder from GSM-AMR to G.723.1 
according to a first embodiment of the present invention. The GSM-AMR bitstream consists 
of 20ms frames of length from 244 bits (31 bytes) for the highest rate mode 12.2kbps, to 95 
bits (12 bytes) for the lowest rate mode 4.75kbps codec. There are eight modes in total. 
Each of the eight GSM-AMR operating modes produces different bitstreams. Since a 

10 G.723.1 frame, being 30ms in duration, consists of one and a half GSM-AMR frames, two 
GSM-AMR frames are needed to produce a single G.723.1 frame. The next G.723.1 frame 
can then be produced on arrival of a third GSM-AMR frame. Thus two G.723.1 frames are 
produced for every three GSM-AMR frames processed. 

[0106] The 10 LSP parameters used by the short-term filter in the GSM-AMR speech 
1 5 production model, are encoded using the same techniques, but in different bitstream formats 
for the different operating modes. The algorithm for reconstructing the LSP parameters is 
given in the GSM-AMR standard documentation. 

[0107] Once the short-term filter parameters have been generated for each subframe, the 
excitation vector needs to be formed by combining the adaptive codeword and the fixed 
20 (algebraic) codeword. The adaptive codeword is constructed using a 60-tap interpolation 
filter based on 1/6* or 1/3"* resolution pitch lag parameter. The fixed codeword is then 
constructed as defined by the standard and the excitation formed as, 

25 

where x is the excitation, v is the interpolated adaptive codeword, c is the fixed codevector, 
and g and g^ are the adaptive and fixed code gains respectively. This excitation is then 
used to update the memory state of the GSM-AMR unpacker, and by the G.723.1 bitstream 
packer for mapping. 

30 [01 08) The adaptive codeword is found for each subframe by forming a linear combination 
of excitation vectors, and finding the optimal match to the target excitation signal, x[], 
constructed by the GSM-AMR unpacker. The combination is a weighted sum of the previous 
excitation at five successive lags. This is best explained via the equation, 

a 
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v[n]=Ysfijuln-L + j] ,0^/1^59 

where v/7 is the reconstructed adaptive codeword, «// is the previous excitation buffer. L is 
5 the (integer) pitch lag between 1 8 and 143 inclusive (determined by from the GSM-AMR 
unpacking module), and the fij are lag weighting values which determine the gain and lag 

phase. The vector table of fij values is searched to optimize the match between the adaptive 
codeword, v//, and the excitation vector, xU- 

[0109] Once the adaptive codebook component of the excitation is found, this component is 
10 subtracted from the excitation to leave a residual ready for encoding by the fixed codebook. 
The residual signal for each subframe is calculated as, 

^^^2 ["] = Jc[n] - v[n], n = 0 59 

15 where X2IJ is the target for the fixed codebook search, x[J is the excitation derived from the 
GSM-AMR unpacking, and v/7 is the (interpolated and scaled) adaptive codeword, 
[01 1 01 The fixed codebooks are different for the high and low rate modes of the G.723 . 1 
codec. The high rate uses an MP-MLQ codebook which allows six pulses per subframe for 
even subframes, and five pulses per subframe for odd subframes, in any position. The low 

20 rate mode uses an algebraic codebook (ACELP) which allows four pulses per subframe in 
restricted locations. Both codebooks use a grid flag to indicate whether to shift the 
codewords should be shifted by one position. These codebooks are searched by the methods 
defined in the standards, except that the impulse response filter is not used since the search is 
being performed in the excitation domain rather than the speech domain. 

25 [0111] The (persistent) memory for the codec needs to be updated on completion of 
processing each subfiaine. This is done by first shifting the previous excitation buffer, m/7, 
by 60 samples (i.e. one subframe), so that the oldest samples are discarded, and then copying 
the excitation from the current subframe into the top 60 samples of the buffer, 



30 



f m[/h-601 -85^«<0 



where the index n is set relative to the first sample of the current subframe, and the other 
parameters have been defined previously. 
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101 12] All the mapped parameters are encoded into the outgoing G.723. 1 bitstream, and the 
system is ready to process the next frame. 



Sftr.nnd Rmhodiment - 6.723.1 To GSM-AMR 

10113) Figure 18 is a block diagram illustrating a transcoder of G.723.1 to GSM-AMR 
5 according to a second embodiment of the present invention. The G.723. 1 bitstream consists 
of frames of length 192 bits (24 bytes) for the high rate (6.3kbps) codec, or 160 bits (20 
bytes) for the low rate (5.3kbps) codec. The frames have a very similar structure and differ 
only in the fixed codebook parameter representation. 

[01 14] The 10 LSP parameters used for modeling the short-term vocal tract filter, are 
1 0 encoded in the same way for both high and low rates and can be extracted from bits 2 to 25 of 
the G.723. 1 frame. Only the LSPs of the fourth subframe are encoded and interpolation 
between frames used to regenerate the LSPs for the other three subframes. The encoding 
uses three lookup tables and the LSP vector reconstructed by joining the three sub-vectors 
derived from these tables. Each table has 256 vector entries; the first two tables have 3- 
15 element sub-vectors, and last table has 4-element sub-vectors. Combined these give a 10- 
element LSP vector. 

(0115] The adaptive codeword is constructed for each subframe by combining previous 
excitation vectors. The combination is a weighted sum of the previous excitation at five 
successive lags. This is best explained via the equation, 



20 



v[«l = ^fiju[n -L + j] .0 ^ « ^ 59 
J—i 



where vfj is the reconstructed adaptive codeword, uO is the previous excitation buffer, I, is 
the (integer) pitch lag between 18 and 143 inclusive, and the fij are lag weighting values 

determined by the pitch gain parameter. 

[0116] The lag parameter, L, is extracted directly from the bitstream. The first and third 
25 subframes use the full dynamic range of the lag, whereas, the second and fourth subframes 
encode the lag as an offset from the previous subfile. The lag weighting parameters, fij , 
are determined by table lookup. As a consequence of the adaptive codeword unpacking, an 
approximation to a fractional pitch lag and associated gain can be determined by calculating, 

J.-1 
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(0117) The fixed codebooks are different for the high and low rate modes of the G.723.1 
codec. The high rate mode uses an MP-MLQ codebook which allows six pulses per 
subframe for even subframes, and five pulses per subfirame for odd subfi-ames, in any 
position. The low rate mode uses an algebraic codebook (ACELP) which allows four pulses 
5 per subfi-ame in restricted locations. Both codebooks use a grid flag to indicate whether to 
shift the codewords should be shifted by one position. Algorithms for generating the 
codewords from the encoded bitstream are given in the G.723.1 standard documentation. 
(01 181 The (persistent) memory for the codec needs to be updated on completion of 
processing each subframe. This is done by furst shifting the previous excitation buffer, u[], 
10 by 60 samples (i.e. one subfirame), so that the oldest samples are discarded, and then copying 
the excitation from the current subframe into the top 60 samples of the buffer, 

u[n + 60l -85^«<0 



30 



where the index n is set relative to the first sample of the current subfirame, and the other 
parameters have been defined previously. 

15 (01 19) The GSM- AMR parameter mapping part of the transcoder takes the interpolated 
CELP parameters as explained above, and uses them as a basis for searching the GSM-AMR 
parameter space. The LSP parameters are simply encoded as received, whilst the other 
parameters, namely excitation and pitch lag, are used as estimates for a local search in the 
GSM-AMR space. The following figure shows the main operations which need to take place 

20 on each subframe in order to complete the transcoding. 

(01201 The adaptive codeword is formed by searching the vector of previous excitations up 
to a maximum lag of 143 for a best match with the target excitation. The target excitation is 
determined from the interpolated subframes. The previous excitation can be interpolated by 
1/6 or 1/3 intervals depending on the mode. The optimal lag is found by searching a small 

25 region about the pitch lag determined firom the G.723.1 unpacking module. This region is 
searched to find the optimal integer lag, and then refined to determine the fractional part of 
the lag. The procedure uses a 24-tap interpolation filter to perform the fractional search. The 
first and third subfiames are treated differently to the second and forth. The interpolated 
adaptive codeword, v[], is then formed as. 



v[/i] ^Y.^[n-L- + 6i]+ «[n - 1 + 1 + 1]&«,[6 - 1 + 6i] 
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where uO is the previous excitation buffer, L is the (integer) pitch lag, / is the fractional pitch 
lag in 1/6* resolution, and beo is the 60-tap interpolation filter. 

[0121] The pitch gain is calculated and quantised so that it can be encoded and sent to the 
decoder, and also for calculation of the fixed codebook target vector. All modes calculate the 
5 pitch gain in the same way for each subframe, 

r 

*' V V 

where gp is the unquantised pitch gain, x is the target for the adaptive codebook search, and v 
10 is the (interpolated) adaptive codeword vector. The 12.2kbps and 7.95kbps modes quantise 
the adaptive and fixed codebook gains independently, whereas the other modes use joint 
quantisation of the fixed and adsqptive gains. 

[0122] Once the adaptive codebook component of the excitation is found, this component is 
subtracted firom the excitation to leave a residual ready for encoding by the fixed codebook. 
15 The residual signal for each subfirame is calculated as, 

acj[/i]=x[/i]-gpv[/i], /j = 0,...,39 

where jc^// is the target for the fixed codebook search, x[] is the target for the adaptive 
20 codebook search, is the quantised pitch gain, and vfj is the (interpolated) adaptive. 

[01231 The fixed codebook search is designed to find the best match to the residual signal 
after the adaptive codebook component has been removed. This is important for unvoiced 
speech and for priming of the adaptive codebook. The codebook search used in transcoding 
can be simpler than the one used in the codecs since a great deal of analysis of the original 
25 speech has already taken place. Also the signal on which the codebook search is performed 
is the reconstructed excitation signal instead of synthesized speech, and therefore already 
possesses a structure more amenable to fixed book coding. 

[0124] The gain for the fixed codebook is quantised using a moving average prediction 
based on the energy of the previous four subfiames. The correction factor between the actual 
30 and predicted gain is quantised (via table lookup) and sent to the decoder. Exact details are 
given in the GSM-AMR standard documentation. 

[0125] The (persistent) memory for the codec needs to be updated on completion of 
processing each subfiame. This is done by first shifting the previous excitation buffer, uQ, 
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by 40 samples (i.e. one subframe), so that the oldest samples are discarded, and then copying 
the excitation from the current subframe into the top 40 samples of the buffer, 



where the index n is set relative to the first sample of the current subframe, and the other 
parameters have been defined previously. 

[01261 While there has been illustrated and described what are presently considered to be 
example embodiments of the present invention, it will be understood by those skilled in the 
art that various other modifications may be made, and equivalents may be substituted, 
without departing from the true scope of the invention. Additionally, many modifications 
may be made to , adapt , a particular situation to the teachings of the present invention without 
departing from the central inventive concept described herein. 



u 




M[n + 40l -114^n<0 
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wwATigri ATMED IS: 



1 1 . An apparatus for converting CELP frames from one CELP-based 

2 .standard to another CELP based standard, and/or within a single standard but to a different 

3 mode, comprising: 

4 a bitstream unpacking module for extracting one or more CELP parameters 

5 from a source codec; 

6 an interpolator module coupled to the bitstream unpacking module, the 

7 interpolator module being adapted to interpolate between different frame sizes, subframe 

8 sizes, and/or sampling rates of the source codec and a destination codec; 

9 a mapping module coupled to the interpolator module, the mapping module 

10 being adapted to map the one or more CELP parameters from the source codec to one or 

1 1 more CELP parameters of the destination codec; 

1 2 a destination bitstream packing module coupled to the mapping module, the 

13 destination bitstream packing module being adapted to construct at least one destination 

14 output CELP frame based upon at least the one or more CELP parameters from the 

15 destination codec; and 

16 a controller coupled to at least the destination bitstream packing module, the 

17 mapping module, the interpolator module, and the bitstream unpacking module, the controller 

1 8 being adapted to oversee operation of one or more of the modules and being adapted to 

19 receive instructions from one or more external applications, the controller being adapted to 

20 provide a status information to one or more of the external applications. 

1 2. The apparatus of claim 1 wherein the controller is a single controller or 

2 multiple controllers. 

1 3. The apparatus of claim 1 wherein the mapping module and the 

2 destination bitstream packing module are within a same module. 

1 4. The apparatus of claim 1 wherein the mapping module is a single 

2 module or multiple modules. 

1 5. The apparatus of claim 1 wherein the interpolation module is a single 

2 module or multiple modules. 
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1 6. The apparatus of claim 1 , wherein said bitstream unpacking module 

2 comprises: 

3 a bitstream processor, the bitstream processor being adapted to extract 

4 information in a first format of the one or more CELP parameter in source CELP codec input 

5 frame; 

6 an LSP decoding module coupled to the bitstream processor, the LSP 

7 decoding module being adapted to output one or more LSP coefficieftts using at least the 

8 information from the source CELP codec input frame; 

9 a decoding module coupled to the bitstream processor, the decoding module 

1 0 being adapted to decode the information to output a pitch lag parameter and a pitch gain 

1 1 parameter from the source CELP codec input frame; 

12 a fixed codebook decoding module coupled to the bitstream processor, the 

13 fixed codebook decoding module being adapted to decode the information to output a fixed 

14 codebook vector; 

1 5 an adaptive codeword decoding module coupled to the bitstream processor, 

1 6 the adaptive codeword decoding module being adapted to decode the information to output 

1 7 ad^tive codebook contribution vector; and 

18 an excitation generator coupled to the fixed codebook decoding module and 

1 9 the adaptive codeword decoding module, the excitation generator being adapted to output an 

20 excitation vector using at least the fixed codebook vector and the adj^tive codebook vector. 

1 7. The apparatus of claim 1 , wherein the interpolator module comprises: 

2 an LSP process, the LSP process being adapted to converts one or more LSP 

3 coefficients of a source codec into one or more LSP coefficients of a destination codec when 

4 said source codec and destination codec have a different subfi-ame size; 

5 an adaptive codebook process, the adaptive codebook process being adapted to 

6 convert a pitch lag and a pitch gain from the source codec into a pitch lag and pitch gain of 

7 the destination codec when said source codec and destination codec have a different subfile 

8 size; 

9 a CELP parameter buffer, the CELP parameter buffer being adapted hold the 

1 0 one or more CELP parameters that need to be buffered for interpolation when source codec 

1 1 and destination codec have a different subframe size; 
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1 8- The apparatus of claim 1 , wherein the parameter mapping and tuning 

2 module comprises: 

3 a parameter mapping and tuning strategy switching module, the strategy 

4 switching module being adapted to select a CELP parameter mapping strategy based upon a 

5 plurality of strategies; 

6 a parameter mapping and tuning strategies module, the mapping and tuning 

7 strategies module being adapted to output the one or more destination CELP parameters. 

1 9. The apparatus of claim 8 wherein the plurality of strategies comprises: 

2 CELP parameter direct space mapping module; 

3 filtered excitation space domain analysis module; and 

4 analysis in excitation space domain module. 

1 1 0. The apparatus of claim 8, wherein said the parameter mapping and 

2 tuning strategies module comprises: 

3 an LSP coefficient converter that encodes the destination LSP coefficients; 

4 a CELP excitation mapping unit that takes CELP excitation parameters 

5 including pitch lag, gain, and excitation vectors from interpolation to get encoded CELP 

6 excitation parameters. 

1 11. The apparatus of claim 1 0, wherein said the CELP excitation mapping 

2 unit comprises: 

3 a module of CELP parameters direct space mapping that produces encoded 

4 destination CELP parameters using analytical formula without any iterating; 

5 a module of analysis in excitation space domain mapping that produces 

6 encoded destination CELP parameters by searching in the excitation space domain; 

7 a module of analysis in filtered excitation space domain mapping that 

8 produces encoded destination CELP parameters by searching adaptive closed-loop in 

9 excitation space and fixed-codebook in filtered excitation space; 

1 12. The apparatus of claim 1, wherein said destination bitstream packing 

2 module comprises a plurality of frame packing facilities, each of the facilities being capable 

3 of adapting to a preselected application firom a plurahty of applications for a selected 

4 destination CELP coder, the selected destination CELP coder being one of a plurality of 

5 CELP coders including the destination CELP coder. 
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1 13. The apparatus of claim 1 , wherein said controller comprises: 

2 a control unit which receives external instructions and controls each signal 

3 processing modules; 

4 a status unit which sends transcoding information such as frame, counts, error 

5 log and etc to external upon the request. 

1 1 4. The apparatus of claim 1 , wherein the interpolation module can be 

2 selected from linear interpolation or non-linear interpolation. 

1 15. The apparatus of claim 7, wherein said CELP parameter buffer 

2 comprises: 

3 an excitation vector buffer, the excitation vector being adapted to store the 

4 reconstructed excitation vector which waits for mapping in next subframe or frame; 

5 an LSP coefficient buffer that stores the before or after interpolation LSP 

6 coefficients which wait for mapping in next subframe or frame; 

7 a CELP other parameters buffer that stores the before or after interpolation 

8 pitch lag, pitch gain, codebook gain and index which wait for mapping in the next subframe or 

9 frame. 

1 1 6. A method for transcoding a CELP based compressed voice bitstream 

2 from source codec to destination codec, comprising: 

3 processing a source codec input CELP bitstream to unpack at least one or 

4 more CELP parameters from the input CELP bitstream; 

5 interpolating one or more of the plurality of unpacked CELP parameters from 



6 a source codec format to a destination codec format if a difference of one or more of a 

7 plurality of destination codec parameters including a frame size, a subframe size, and/or 

8 sampling rate of the destination codec format and one or more of a plurality of source codec 

9 parameters including a frame size, a subframe size, or sampling rate of the source codec 

10 format exist; 

1 1 encoding the one or more CELP parameters for the destination codec; and 

12 processing a destination CELP bitstream by at least packing the one or more 

1 3 CELP parameters for the destination codec. 



29 



wo 03/058407 PCTAJS03/00649 

♦ 

1 1 7. The method of claim 1 6, wherein the processing of the source codec 

2 input comprises: 

3 converting an input bitstream frame into information associated with one or 

4 more CELP parameters ; 

5 decoding the information into one or more CELP parameters; 

6 reconstruct an excitation vector based upon at least the one or more CELP 

7 parameters; 

8 output the CELP parameters to an interpolator. 

1 18. The method of claim 1 6, wherein the interpolating comprises: 

2 interpolating one or more of the LSP coefficients from the source codec to one 

3 or more LSP coefficients for the destination codec; 

4 interpolating other CELP parameters than the LSP coefficients from the source 

5 codec to other CELP parameters for the destination codec; and 

6 if the excitation vector does not require a calibration, transfer the source 

7 excitation vector to the encoding process. 

1 1 9. The method of claim 1 8, further comprising: 

2 converting the one or more LSP coefficients using a linear transform process. 

1 20. The method of claim 1 8, further comprising; 

2 converting the source codec excitation vector to a synthesized speech vector 

3 by using at least one or more of the source decoded LPC coefficients; 

4 quantising destination LPC coefficients; 

5 converting the synthesized speech vector back to calibrated excitation vector 

6 by using at least the quantised destination LPC coefficients; and 

7 transferring the calibrated excitation vector to another process. 

1 21. The method of claim 1 6, wherein the encoding comprises: 

2 quantising destination LPC coefficients; 

3 selecting one of CELP mapping strategies according to the control signal from 

4 parameter mapping and tuning strategy switching module; 

5 • CELP parameters direct space mapping; 

6 • analysis in excitation space domain; 

7 . • analysis in filtered excitation space domain. 
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1 22. The method of claim 2 1 , wherein operation of said CELP parameters 

2 direct space mapping comprises the operations of: 

3 encoding the pitch lag from interpolated pitch lag parameter; 

4 encoding the pitch gain from interpolated pitch gain parameter; 

5 encoding the index of fixed codebook from analytical forms; 

6 encoding the gain of fixed codebook gain parameter; 

1 23 . The method of claim 2 1 , wherein operation of analysis in excitation 

2 space domain mapping comprises the operations of: 

3 selecting pitch lag from interpolated pitch lag parameter as initial 

4 value; 

5 searching pitch lag in closed-loop in excitation space; 

6 searching pitch gain in excitation space; 

7 constructing target signal for fixed codebook search; 

8 searching fixed codebook index in excitation space; 

9 searching fixed codebook gain in excitation space; 
1 0 updating the previous excitation vector; 

1 24. The method of claim 2 1 , wherein operation of analysis in fihered 

2 excitation space domain mapping comprises the operations of: 

3 selecting pitch lag fi-om interpolated pitch lag parameter as initial 

4 value; 

5 searching pitch lag in closed-loop in excitation space; 

6 searching pitch gain in excitation space; 

7 constructing target signal for fixed codebook search; 

8 searching fixed codebook index in filtered excitation space; 

9 searching fixed codebook gain in filtered excitation space; 

1 0 updating the previous excitation vector; 

1 25 . The method of claim 2 1 , wherein said selection is not only restricted to 

2 above three strategies, the combination of three strategies can be selected as a new mapping 

3 strategy. 
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1 26. As in claim 1 , with the addition of a silence frame transcoding unit 

2 which can perform rapid conversion of silence frames from one speech coding standard to 

3 another. This involves mapping the comfort noise parameters. 

1 27. - As in claim 1 , but where the parameter mapping and tuning module 

2 consists of a voice activity detector for generating silence frames. The voice activity detector 

3 makes its speech/silence determination based on the parameters in the CELP space. 

1 28. As in claim 1 , but with the addition of a system for changing the 

2 excitation mapping strategy used thereby providing a mechanism to adapt to available 

3 computational resources and allow for graceful quality degradation under load. 

1 29. The excitation mapping is performed without going back to the speech 

2 signal domain. 

1 30. A method for processing CELP based compressed voice bitstreams 

2 from source codec to destination codec formats, the method comprising: 

3 transferring a control signal from a plurality of control signals from an 

4 application process; 

5 selecting one CELP mapping strategy from a plurality of different 

6 CELP mapping strategies based upon at least the control signal from the application; and 

7 performing a mapping process using the selected CELP mapping 

8 strategies to map one or more CELP parameters from a source codec foraiat to one or more 

9 CELP parameters of a destination codec format. 

1 31. The method of claim 30 wherein the plurality of CELP mapping 

2 strategies including: 

3 CELP parameters direct space mapping; or 

4 analysis in excitation space domain; or 

5 analysis in fihered excitation space domain. 

1 32. The method of claim 30 wherein the selecting of the one CELP 

2 mapping strategy is for a predetermined application during a setup process or construction 

3 process. 
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1 33. The method of claim 30 further comprising receiving the control signal 

2 at a switching module, the switching module being coupled to each of the plurality of 

3 mapping strategies. 

1 34. The method of claim 30 wherein the control signal is provided based 

2 upon a computing resource characteristic of the selected CELP mapping strategy, 

1 35. The method of claim 30 wherein one or more of the plurality of 

2 mapping strategies are provided in a library in memory. 

1 36. The method of claim 3 1 further comprising encoding the one or more 

2 CELP parameters for the destination codec; and 

3 processing a destination CELP bitstream by at least packing the one or 

4 more CELP parameters for the destination codec. 

1 37. The method of claim 36 further comprising transferring the packed 

2 destination CELP bitstream to the destination codec. 

1 38. A system for processing CELP based compressed voice bitstreams 

2 from source codec to destination codec formats, the system comprising: 

3 one or more codes for receiving a control signal from a plurality of 

4 control signals from an application process; 

5 one or more codes for selecting one CELP mapping strategy from a 

6 plurality of different CELP mapping strategies based upon at least the control signal from the 

7 application; and 

8 one or more codes for performing a mapping process using the selected 

9 CELP mapping strategies to map one or more CELP parameters from a source codec format 
1 0 to one or more CELP parameters of a destination codec format. 

1 39. The system of claim 38 wherein the plurality of CELP mapping 

2 strategies including: 

3 one or more codes directed to CELP parameters direct space mapping; 

4 or 

5 one or more codes directed to analysis in excitation space domain; or 
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6 one or more codes directed to analysis in filtered excitation space 

7 domain. 

1 40. The system of claim 38 wherein the selected CELP mapping strategy is 

2 for a predetermined application. 

1 41 . The system of claim 38 further comprising the one or more codes 

2 directed to receiving the control signal is provided at a strategy switching module, the 

3 strategy switching module being coupled to each of the plurality of mapping strategies. 

1 42. The system of claim 3 8 wherein the control signal is provided based 

2 upon a computing resource characteristic of the selected CELP mapping strategy. 

1 43. The system of claim 38 wherein one or more codes directed to the 

2 plurality of mapping strategies are provided in a library in memory. 

1 44. The system of claim 43 further comprising one or more codes directed 

2 to encoding the one or more CELP parameters for the destination codec; and 

3 one or more codes directed to processing a destination CELP bitstream 

4 by at least packing the one or more CELP parameters for the destination codec. 

1 45. The system of claim 44 further comprising one or more codes directed 

2 to transferring the destination CELP bitstream to the destination codec: 

1 46. The system of claim 44 further comprising one or more codes directed 

2 to transferring the destination CELP bitstream to a storage location. 
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